Team - Triston Hudgins, Shijo Joseph, Osman Kanteh, Douglas Yip
## Setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter('ignore', DeprecationWarning)
%matplotlib inline
import seaborn as sns
import plotly.express as px
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.preprocessing import StandardScaler
from matplotlib.pyplot import scatter
import plotly
from plotly.graph_objs import Scatter, Marker, Layout, layout,XAxis, YAxis, Bar, Line
%matplotlib inline
##Decision tree setup
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
# load the airline satisfaction dataset
df = pd.read_csv('https://raw.githubusercontent.com/dk28yip/MSDS7331/main/airline.csv') # read in the csv file
df.head()
#reduced samples set from 100,000 to 30,000 as a few of us had computer performance issues
df = df.sample(n=30000)
# Any missing values in the dataset
def plot_missingness(df: pd.DataFrame=df) -> None:
nan_df = pd.DataFrame(df.isna().sum()).reset_index()
nan_df.columns = ['Column', 'NaN_Count']
nan_df['NaN_Count'] = nan_df['NaN_Count'].astype('int')
nan_df['NaN_%'] = round(nan_df['NaN_Count']/df.shape[0] * 100,4)
nan_df['Type'] = 'Missingness'
nan_df.sort_values('NaN_%', inplace=True)
# Add completeness
for i in range(nan_df.shape[0]):
complete_df = pd.DataFrame([nan_df.loc[i,'Column'],df.shape[0] - nan_df.loc[i,'NaN_Count'],100 - nan_df.loc[i,'NaN_%'], 'Completeness']).T
complete_df.columns = ['Column','NaN_Count','NaN_%','Type']
complete_df['NaN_%'] = complete_df['NaN_%'].astype('int')
complete_df['NaN_Count'] = complete_df['NaN_Count'].astype('int')
nan_df = pd.concat([nan_df,complete_df], sort=True)
nan_df = nan_df.rename(columns={"Column": "Feature", "NaN_%": "Missing %"})
# Missingness Plot
fig = px.bar(nan_df,
x='Feature',
y='Missing %',
title=f"Missingness Plot (N={df.shape[0]})",
color='Type',
opacity = 0.6,
color_discrete_sequence=['red','#808080'],
width=800,
height=800)
fig.show()
plot_missingness(df)
print("Missing 99 values if the 'Arrival Delay in Minutes'column; approximately 0.31%.")
Missing 99 values if the 'Arrival Delay in Minutes'column; approximately 0.31%.
ID was removed from the dataset as it was used as a unique identified for each passenger
df["GenderNumeric"] = (df["Gender"]=="Male").astype(int)
df["CustomerTypeNumeric"] = (df["Customer Type"]=="Loyal Customer").astype(int)
df["TypeofTravelNumeric"] = (df["Type of Travel"]=="Personal Travel").astype(int)
df["ClassNumeric"] = df["Class"]
df["ClassNumeric"].replace(['Eco', 'Eco Plus', 'Business'], [0, 1, 2], inplace=True)
df["Arrival Delay in Minutes"]= df["Arrival Delay in Minutes"].fillna(0)
dfclean = df.drop(columns=['id'])
dfclean.isnull().sum() #double check on the missing values - 'arrival delay in minutes =310'
Gender 0 Customer Type 0 Age 0 Type of Travel 0 Class 0 Flight Distance 0 Inflight wifi service 0 Departure/Arrival time convenient 0 Ease of Online booking 0 Gate location 0 Food and drink 0 Online boarding 0 Seat comfort 0 Inflight entertainment 0 On-board service 0 Leg room service 0 Baggage handling 0 Checkin service 0 Inflight service 0 Cleanliness 0 Departure Delay in Minutes 0 Arrival Delay in Minutes 0 satisfaction 0 GenderNumeric 0 CustomerTypeNumeric 0 TypeofTravelNumeric 0 ClassNumeric 0 dtype: int64
#Fill in missing values
dfclean["Arrival Delay in Minutes"].fillna(dfclean["Arrival Delay in Minutes"].median(), inplace=True)
dfclean.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Age | 30000.0 | 39.333333 | 15.154087 | 7.0 | 27.0 | 40.0 | 51.0 | 85.0 |
| Flight Distance | 30000.0 | 1185.172233 | 996.070790 | 31.0 | 414.0 | 836.5 | 1726.0 | 4983.0 |
| Inflight wifi service | 30000.0 | 2.732333 | 1.326479 | 0.0 | 2.0 | 3.0 | 4.0 | 5.0 |
| Departure/Arrival time convenient | 30000.0 | 3.061733 | 1.523133 | 0.0 | 2.0 | 3.0 | 4.0 | 5.0 |
| Ease of Online booking | 30000.0 | 2.745933 | 1.394506 | 0.0 | 2.0 | 3.0 | 4.0 | 5.0 |
| Gate location | 30000.0 | 2.967167 | 1.272874 | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 |
| Food and drink | 30000.0 | 3.207767 | 1.325993 | 0.0 | 2.0 | 3.0 | 4.0 | 5.0 |
| Online boarding | 30000.0 | 3.256067 | 1.346263 | 0.0 | 2.0 | 3.0 | 4.0 | 5.0 |
| Seat comfort | 30000.0 | 3.439733 | 1.317887 | 1.0 | 2.0 | 4.0 | 5.0 | 5.0 |
| Inflight entertainment | 30000.0 | 3.356600 | 1.332377 | 0.0 | 2.0 | 4.0 | 4.0 | 5.0 |
| On-board service | 30000.0 | 3.393267 | 1.291664 | 1.0 | 2.0 | 4.0 | 4.0 | 5.0 |
| Leg room service | 30000.0 | 3.342067 | 1.316098 | 0.0 | 2.0 | 4.0 | 4.0 | 5.0 |
| Baggage handling | 30000.0 | 3.625367 | 1.190461 | 1.0 | 3.0 | 4.0 | 5.0 | 5.0 |
| Checkin service | 30000.0 | 3.305067 | 1.265512 | 1.0 | 3.0 | 3.0 | 4.0 | 5.0 |
| Inflight service | 30000.0 | 3.643867 | 1.175875 | 1.0 | 3.0 | 4.0 | 5.0 | 5.0 |
| Cleanliness | 30000.0 | 3.293233 | 1.312036 | 0.0 | 2.0 | 3.0 | 4.0 | 5.0 |
| Departure Delay in Minutes | 30000.0 | 14.674167 | 37.877541 | 0.0 | 0.0 | 0.0 | 12.0 | 978.0 |
| Arrival Delay in Minutes | 30000.0 | 14.978233 | 37.996260 | 0.0 | 0.0 | 0.0 | 13.0 | 970.0 |
| GenderNumeric | 30000.0 | 0.490367 | 0.499916 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
| CustomerTypeNumeric | 30000.0 | 0.818067 | 0.385796 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| TypeofTravelNumeric | 30000.0 | 0.313800 | 0.464044 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
| ClassNumeric | 30000.0 | 1.023467 | 0.962954 | 0.0 | 0.0 | 1.0 | 2.0 | 2.0 |
dfclean.corr()
| Age | Flight Distance | Inflight wifi service | Departure/Arrival time convenient | Ease of Online booking | Gate location | Food and drink | Online boarding | Seat comfort | Inflight entertainment | ... | Baggage handling | Checkin service | Inflight service | Cleanliness | Departure Delay in Minutes | Arrival Delay in Minutes | GenderNumeric | CustomerTypeNumeric | TypeofTravelNumeric | ClassNumeric | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Age | 1.000000 | 0.088761 | 0.010898 | 0.035966 | 0.017280 | -0.008419 | 0.022636 | 0.209064 | 0.159351 | 0.069824 | ... | -0.050521 | 0.033044 | -0.059903 | 0.056137 | -0.010983 | -0.013793 | 0.009616 | 0.276444 | -0.045820 | 0.134797 |
| Flight Distance | 0.088761 | 1.000000 | -0.000179 | -0.028230 | 0.059409 | 0.006073 | 0.062344 | 0.210604 | 0.154774 | 0.128284 | ... | 0.065263 | 0.072926 | 0.053089 | 0.092888 | 0.007348 | 0.004873 | 0.002458 | 0.225608 | -0.265108 | 0.451733 |
| Inflight wifi service | 0.010898 | -0.000179 | 1.000000 | 0.347462 | 0.717276 | 0.331487 | 0.134376 | 0.455373 | 0.124499 | 0.209159 | ... | 0.120423 | 0.038001 | 0.110281 | 0.136155 | -0.024201 | -0.025128 | 0.011443 | 0.007496 | -0.099762 | 0.030675 |
| Departure/Arrival time convenient | 0.035966 | -0.028230 | 0.347462 | 1.000000 | 0.439865 | 0.437732 | 0.006044 | 0.070013 | 0.011336 | -0.003539 | ... | 0.077063 | 0.095669 | 0.076152 | 0.017797 | 0.000105 | -0.003089 | 0.014659 | 0.197410 | 0.265753 | -0.102602 |
| Ease of Online booking | 0.017280 | 0.059409 | 0.717276 | 0.439865 | 1.000000 | 0.449523 | 0.037039 | 0.405594 | 0.035889 | 0.050755 | ... | 0.039327 | 0.012339 | 0.035587 | 0.022064 | -0.010817 | -0.012432 | 0.008634 | 0.015385 | -0.122766 | 0.101451 |
| Gate location | -0.008419 | 0.006073 | 0.331487 | 0.437732 | 0.449523 | 1.000000 | -0.003463 | 0.001483 | -0.006495 | -0.002177 | ... | -0.002530 | -0.034383 | -0.007300 | -0.005852 | 0.006564 | 0.005248 | -0.008748 | -0.004155 | -0.025673 | 0.002478 |
| Food and drink | 0.022636 | 0.062344 | 0.134376 | 0.006044 | 0.037039 | -0.003463 | 1.000000 | 0.237150 | 0.575546 | 0.622686 | ... | 0.031572 | 0.083065 | 0.033219 | 0.654814 | -0.034701 | -0.038656 | 0.005760 | 0.058515 | -0.067930 | 0.084525 |
| Online boarding | 0.209064 | 0.210604 | 0.455373 | 0.070013 | 0.405594 | 0.001483 | 0.237150 | 1.000000 | 0.421982 | 0.283006 | ... | 0.081116 | 0.191832 | 0.069611 | 0.328474 | -0.034074 | -0.036619 | -0.036949 | 0.190336 | -0.224619 | 0.322052 |
| Seat comfort | 0.159351 | 0.154774 | 0.124499 | 0.011336 | 0.035889 | -0.006495 | 0.575546 | 0.421982 | 1.000000 | 0.604920 | ... | 0.073030 | 0.184772 | 0.067718 | 0.673116 | -0.033966 | -0.036365 | -0.026255 | 0.160240 | -0.122023 | 0.222807 |
| Inflight entertainment | 0.069824 | 0.128284 | 0.209159 | -0.003539 | 0.050755 | -0.002177 | 0.622686 | 0.283006 | 0.604920 | 1.000000 | ... | 0.372632 | 0.116432 | 0.406043 | 0.691391 | -0.031402 | -0.035593 | 0.004107 | 0.106310 | -0.151017 | 0.188207 |
| On-board service | 0.047301 | 0.099231 | 0.112452 | 0.067752 | 0.032247 | -0.029634 | 0.057255 | 0.142793 | 0.125524 | 0.417427 | ... | 0.514255 | 0.250339 | 0.552803 | 0.120820 | -0.021523 | -0.026300 | 0.010926 | 0.044113 | -0.056015 | 0.197146 |
| Leg room service | 0.035710 | 0.127854 | 0.148875 | 0.011067 | 0.100626 | -0.012060 | 0.024944 | 0.113150 | 0.100312 | 0.295708 | ... | 0.373361 | 0.152777 | 0.367549 | 0.090902 | 0.009297 | 0.004727 | 0.043565 | 0.050027 | -0.127187 | 0.195486 |
| Baggage handling | -0.050521 | 0.065263 | 0.120423 | 0.077063 | 0.039327 | -0.002530 | 0.031572 | 0.081116 | 0.073030 | 0.372632 | ... | 1.000000 | 0.239732 | 0.623748 | 0.095028 | -0.009188 | -0.014576 | 0.041854 | -0.022627 | -0.027709 | 0.161611 |
| Checkin service | 0.033044 | 0.072926 | 0.038001 | 0.095669 | 0.012339 | -0.034383 | 0.083065 | 0.191832 | 0.184772 | 0.116432 | ... | 0.239732 | 1.000000 | 0.240392 | 0.173525 | -0.021182 | -0.024776 | 0.011548 | 0.035507 | 0.023392 | 0.150809 |
| Inflight service | -0.059903 | 0.053089 | 0.110281 | 0.076152 | 0.035587 | -0.007300 | 0.033219 | 0.069611 | 0.067718 | 0.406043 | ... | 0.623748 | 0.240392 | 1.000000 | 0.088736 | -0.052682 | -0.058853 | 0.043555 | -0.026731 | -0.021220 | 0.147718 |
| Cleanliness | 0.056137 | 0.092888 | 0.136155 | 0.017797 | 0.022064 | -0.005852 | 0.654814 | 0.328474 | 0.673116 | 0.691391 | ... | 0.095028 | 0.173525 | 0.088736 | 1.000000 | -0.018905 | -0.022607 | 0.003672 | 0.083864 | -0.083686 | 0.134758 |
| Departure Delay in Minutes | -0.010983 | 0.007348 | -0.024201 | 0.000105 | -0.010817 | 0.006564 | -0.034701 | -0.034074 | -0.033966 | -0.031402 | ... | -0.009188 | -0.021182 | -0.052682 | -0.018905 | 1.000000 | 0.958246 | -0.003874 | -0.007068 | -0.010379 | -0.010214 |
| Arrival Delay in Minutes | -0.013793 | 0.004873 | -0.025128 | -0.003089 | -0.012432 | 0.005248 | -0.038656 | -0.036619 | -0.036365 | -0.035593 | ... | -0.014576 | -0.024776 | -0.058853 | -0.022607 | 0.958246 | 1.000000 | -0.007523 | -0.005648 | -0.011786 | -0.014097 |
| GenderNumeric | 0.009616 | 0.002458 | 0.011443 | 0.014659 | 0.008634 | -0.008748 | 0.005760 | -0.036949 | -0.026255 | 0.004107 | ... | 0.041854 | 0.011548 | 0.043555 | 0.003672 | -0.003874 | -0.007523 | 1.000000 | 0.029110 | 0.003404 | 0.008502 |
| CustomerTypeNumeric | 0.276444 | 0.225608 | 0.007496 | 0.197410 | 0.015385 | -0.004155 | 0.058515 | 0.190336 | 0.160240 | 0.106310 | ... | -0.022627 | 0.035507 | -0.026731 | 0.083864 | -0.007068 | -0.005648 | 0.029110 | 1.000000 | 0.311644 | 0.102477 |
| TypeofTravelNumeric | -0.045820 | -0.265108 | -0.099762 | 0.265753 | -0.122766 | -0.025673 | -0.067930 | -0.224619 | -0.122023 | -0.151017 | ... | -0.027709 | 0.023392 | -0.021220 | -0.083686 | -0.010379 | -0.011786 | 0.003404 | 0.311644 | 1.000000 | -0.543666 |
| ClassNumeric | 0.134797 | 0.451733 | 0.030675 | -0.102602 | 0.101451 | 0.002478 | 0.084525 | 0.322052 | 0.222807 | 0.188207 | ... | 0.161611 | 0.150809 | 0.147718 | 0.134758 | -0.010214 | -0.014097 | 0.008502 | 0.102477 | -0.543666 | 1.000000 |
22 rows × 22 columns
f, ax = plt.subplots(figsize=[18, 13])
sns.heatmap(dfclean.corr(), annot=True, fmt=".2f", ax=ax, cmap="bwr")
ax.set_title("Correlation Matrix", fontsize=20)
plt.show()
Very strong correlations (values from 0.8 to 1 or -0.8 to -1.0) Strong correlations (values from 0.6 to 0.8 or -0.6 to -0.8) Moderate correlations (values from 0.4 to 0.6 or -0.4 to -0.6)
##distribution of the data
for column in dfclean:
sns.displot(x=column, data=dfclean)
C:\Datascience\Anaconda3\envs\ML7331\lib\site-packages\seaborn\axisgrid.py:409: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
print (dfclean.info())
<class 'pandas.core.frame.DataFrame'> Int64Index: 30000 entries, 15146 to 77455 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Gender 30000 non-null object 1 Customer Type 30000 non-null object 2 Age 30000 non-null int64 3 Type of Travel 30000 non-null object 4 Class 30000 non-null object 5 Flight Distance 30000 non-null int64 6 Inflight wifi service 30000 non-null int64 7 Departure/Arrival time convenient 30000 non-null int64 8 Ease of Online booking 30000 non-null int64 9 Gate location 30000 non-null int64 10 Food and drink 30000 non-null int64 11 Online boarding 30000 non-null int64 12 Seat comfort 30000 non-null int64 13 Inflight entertainment 30000 non-null int64 14 On-board service 30000 non-null int64 15 Leg room service 30000 non-null int64 16 Baggage handling 30000 non-null int64 17 Checkin service 30000 non-null int64 18 Inflight service 30000 non-null int64 19 Cleanliness 30000 non-null int64 20 Departure Delay in Minutes 30000 non-null int64 21 Arrival Delay in Minutes 30000 non-null float64 22 satisfaction 30000 non-null object 23 GenderNumeric 30000 non-null int32 24 CustomerTypeNumeric 30000 non-null int32 25 TypeofTravelNumeric 30000 non-null int32 26 ClassNumeric 30000 non-null int64 dtypes: float64(1), int32(3), int64(18), object(5) memory usage: 7.1+ MB None
A total of +100,000 passenger results are recorded in this data set. We have a combination of categorical, ordinal and continous variable in this dataset.
- F1 measurement consider the combination of both precision and recall to computed a models performance. Interpretation of the F1 score is a weighted average of the precision and recall values. When F1 score reaches 1, it is consider to have the best probable model performance while 0 would define it the worst.
1) If the model predicts unsatisfied/neutral customers as satisfied customers (high false negatives), the recall probability of our model would be low, and from an airline stand point they could be losing revenue assuming that all things are all right. We want the model to pin point unsatisfied/neutral customers to ensure the company focuses on the right priorities to maximize revenues and profits. As a result of this, we want a high recall score as we do not want our model to label a unsatisfied/neutral customers as satisfied customers.
2) If the model predicts lots of satisfied customers as unsatisfied/neutral customers (high false positives), the precision probability of our model would be low. This may result in airlines prioritizing investments into intiatives to improve satisfaction when they don't need to resulting in possible lower profies. As a a reult of this, we want a high precision score to optimize the airlines profits to avoid wasted investments in which we want the model not to label satisfied customers as a unsatisfied/neutral customers.
Given an airline wants to maximize revenues and profits through customer satification, the model requires to have high precision and recall score. As such, the measure of the F1 score suffices to manage model performance as both these metrics are contained in this measure. To achieve the most optimal model, our model should have the highest F1 score
Before we determine how we train our data, we are going to check our dataset to see if we have an unbalanced dataset
print(dfclean["satisfaction"].value_counts())
fig = plt.figure(figsize=(10, 5))
dfclean.groupby('satisfaction').size().plot(kind='pie',
y = "satisfaction",
label = "Type",
autopct='%1.0f%%')
neutral or dissatisfied 17045 satisfied 12955 Name: satisfaction, dtype: int64
<AxesSubplot:ylabel='Type'>
#chang output to 1-Satisfied and 0-neutral/unsatisfied
dfclean["satisfaction"] = dfclean["satisfaction"].apply(lambda x: 1 if x == "satisfied" else 0)
We will use repeated (10 times) 10-fold cross validation for our analysis. The reason why we selected this method is enables us to utilize the whole data set to find the best training model for our dataset. Below are a few other reasons why we selected a 10 fold CV vs just a singl test train data set. 1) More metrics - We get to know more about the model and our underlining assumptions. Espicially given that we don't have domain knowledge of the data, this will inform us better about the data itself 2) Parameters Fine-Tuning - CV allows us to fine tune our model. Selecting the parameters of each model and optimizing those parameters 3) Avoids overfitting - CV runs the model mulitple times. Since 80/20 splits, we assume that our examples are independent. That means that by knowing/seeing some instance will not help us understand other instances. Unfortunately with large datasets this may not be the case and may result in overfitting.
The rest of our lab, we will be using cross_val_score from sk_learn for chosen models.
from numpy import mean
from numpy import std
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
x = dfclean.drop(["satisfaction", "Class", "Gender", "Customer Type", "Type of Travel"],axis=1)
y = dfclean['satisfaction']
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=10, random_state=1)
from sklearn.linear_model import LogisticRegression
from sklearn import metrics as mt
lr_clf = LogisticRegression(penalty='l2', C=1, class_weight=None, solver='liblinear' ) # get object
#calculate logistic regression model average f1 scores of cross validation
lr_scores = cross_val_score(lr_clf, x, y, cv=cv)
print('Logsitics Regression of F1 Score of repeated (10 times) 10-fold cross validation: %.3f (%.3f)' % (mean(lr_scores), std(lr_scores)), "\n\n")
Logsitics Regression of F1 Score of repeated (10 times) 10-fold cross validation: 0.876 (0.006)
# check how C changes with accruacy prediction for logistic regression
from sklearn.metrics import accuracy_score
scl_obj = StandardScaler()
scl_obj.fit(x)
x_scaled = scl_obj.transform(x)
x_train, x_test, y_train, y_test = train_test_split(x_scaled, y, test_size=.2)
accuracy, params = [], []
for c in np.arange(1, 20):
log_linear = LogisticRegression(penalty='l2', C= c*0.1, class_weight=None, solver='liblinear')
log_linear.fit(x_train,y_train)
y_hat = log_linear.predict(x_test)
accuracy.append(accuracy_score(y_test, y_hat))
params.append(c)
accuracy = np.array(accuracy)
plt.plot(params, accuracy)
plt.ylabel('accuracy of prediction')
plt.xlabel('C')
plt.legend(loc='upper left')
#plt.xscale('log')
plt.show()
No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
from sklearn.model_selection import GridSearchCV
parameters = {'C':[1, 2, 5, 10, 20, 50]}
log_reg_model = LogisticRegression(max_iter=50000,penalty='l2',class_weight=None,solver='liblinear')
cv_grid = GridSearchCV(log_reg_model, parameters)
cv_grid.fit(x_train, y_train)
cv_grid.best_params_
{'C': 1}
Our sensitivity analysis on C on both plotting and Gridsearch showed that c=1 yielding the best accuracy to the model. Hence Logistic Regression model used c=1 for this lab.
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
DT_model = DecisionTreeClassifier()
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.2)
#calculate decision tree model average f1 scores of cross validation
DT_scores = cross_val_score(DT_model, x, y, scoring='f1_macro', cv=cv)
print('Decision Tree F1 Score of repeated (10 times) 10-fold cross validation: %.3f (%.3f)' % (mean(DT_scores), std(DT_scores)), "\n\n")
Decision Tree F1 Score of repeated (10 times) 10-fold cross validation: 0.935 (0.005)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
KNN_model = KNeighborsClassifier(n_neighbors=5)
x_scaled = x
scaler = StandardScaler()
scaler.fit(x_scaled)
x_scaled = scaler.transform(x_scaled)
knn_scores = []
for i in list(range(1,10)):
knn_loop_model = KNeighborsClassifier(n_neighbors=i)
scores = cross_val_score(knn_loop_model, x_scaled, y, scoring='f1_macro', cv=cv)
knn_scores.append(mean(scores))
#graph the f1 scores of the different knns
sns.set()
k_scores = pd.DataFrame()
k_scores['k'] = list(range(1, 10))
k_scores['score'] = knn_scores
sns.scatterplot(data=k_scores, x='k', y='score').set(title='K Values and Scores')
[Text(0.5, 1.0, 'K Values and Scores')]
We observed that odd number KNN values to be more optimal and observed that K=5 to be most optimal. As such we used K=5 for the KNN model selection
#calculate knn model average f1 scores of cross validation
knn_scores = cross_val_score(KNN_model, x_scaled, y, scoring='f1_macro', cv=cv)
print('F1 Score: %.3f (%.3f)' % (mean(knn_scores), std(knn_scores)))
F1 Score: 0.919 (0.005)
Pros:
- Makes no assumptions about distributions of classes in feature space
- Easier to implement, interpret, and very efficient to train
- Easier to intepret model coefficients as indicators of feature importance
Cons:
- Constructs linear boundaries
- Non-linear problems can’t be solved with logistic regression because it has a linear decision surface
Pros:
- Easy to set parameters of the branches
- Does not require scaling of the data
- Does not require normalization of data
- Quantifies values of outcomes and probabilities
Cons:
- Industry stand point it takes more resources (time and money) to complete
- Moderate time require to train the model of large data size
- Setup of dataset is apperative as parameter changes can results in different outcomes
- No training periods as it only learns training set at the time of predictions
- Easy to implement with K and the distance function
- Requires scaling of features
- Requires significant resources for large datasets
import math
#count total samples
dfclean_count=len(dfclean)
#1.96 is the z-score at a 95% CI interval
#Logisitic Regression 95% Confidence Interval
errorLR = 1 - mean(lr_scores)
varLR = ((1-errorLR) * errorLR)/dfclean_count
sqrtNumLR = math.sqrt(varLR)
LRUpper = (errorLR) + (1.96 * sqrtNumLR)
LRLower =(errorLR) - (1.96 * sqrtNumLR)
#KNN 95% Confidence Interval
errorKNN = 1 - mean(knn_scores)
varKNN = ((1-errorKNN) * errorKNN)/dfclean_count
sqrtNumKNN = math.sqrt(varKNN)
KNNUpper = (errorKNN) + (1.96 * sqrtNumKNN)
KNNLower =(errorKNN) - (1.96 * sqrtNumKNN)
#Decision tree 95% Confidence Interval
errorDT = 1 - mean(DT_scores)
varDT = ((1-errorDT) * errorDT)/dfclean_count
sqrtNumDT = math.sqrt(varDT)
DTUpper = (errorDT) + (1.96 * sqrtNumDT)
DTLower =(errorDT) - (1.96 * sqrtNumDT)
print('1) Decision Tree 95% Confident Interval: ',DTUpper, DTLower)
print('2) KNN 95% Confident Interval: ',KNNUpper, KNNLower)
print('3) Logistic Regression 95% Confident Interval: ',LRUpper, LRLower)
1) Decision Tree 95% Confident Interval: 0.06732967855094532 0.06176832222570802 2) KNN 95% Confident Interval: 0.08453001396923618 0.07834007951360225 3) Logistic Regression 95% Confident Interval: 0.127776835405402 0.12031649792793106
Based on the 95% confidence Intervals of the three models, the Decision Tree yielded the best 95% CI when compared to the other two models.
- Feature 0: Age
- Feature 1: Flight Distance
- Feature 2: Inflight wifi Service
- Feature 3: Departure/Arrival time convenient
- Feature 4: Ease of Online Booking
- Feature 5: Gate Location
- Feature 6: Food and Drink
- Feature 7: Online Boarding
- Feature 8: Seat Comfort
- Feature 9: Inflight Entertainment
- Feature 10: On-board Service
- Feature 11: Leg room service
- Feature 12: Baggage handling
- Feature 13: Checkin Service
- Feature 14: Inflight Service
- Feature 15: Cleanliness
- Feature 16: Departure Delay in Minutes
- Feature 17: Arrival Delay in Minutes
- Feature 18: Gender Numeric Val
- Feature 19: Customer Type Numeric
- Feature 20: Type of Travel Numeric
- Feature 21: Class Numeric
dt_clf = DecisionTreeClassifier()
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.2)
dt_clf.fit(x_train,y_train)
yhat = dt_clf.predict(x_test)
print ('accuracy:', mt.accuracy_score(y_test,yhat))
# get the importances
imp = dt_clf.feature_importances_
#print out importat features
DT_important = dt_clf.feature_importances_
for i,v in enumerate(DT_important):
print('Feature: %0d, Score: %.5f' % (i,v))
accuracy: 0.9338333333333333 Feature: 0, Score: 0.02840 Feature: 1, Score: 0.02195 Feature: 2, Score: 0.18038 Feature: 3, Score: 0.00510 Feature: 4, Score: 0.00419 Feature: 5, Score: 0.01455 Feature: 6, Score: 0.00498 Feature: 7, Score: 0.35105 Feature: 8, Score: 0.01347 Feature: 9, Score: 0.04714 Feature: 10, Score: 0.00997 Feature: 11, Score: 0.01219 Feature: 12, Score: 0.01672 Feature: 13, Score: 0.03428 Feature: 14, Score: 0.01472 Feature: 15, Score: 0.01453 Feature: 16, Score: 0.00512 Feature: 17, Score: 0.00865 Feature: 18, Score: 0.00473 Feature: 19, Score: 0.03746 Feature: 20, Score: 0.14977 Feature: 21, Score: 0.02065
Based on the travel demographics, most surveys completed in the dataset were from business travelers. The random samples are consistently pulling almost double the amount of business travelers compared to recreational travelers. Our most influential factors were determined to be Inflight wifi service, online boarding, and traveler type
Airline industries would find this analysis interesting. It enables air line companies to be able to address where to allocate investments to maintain or improve there current customer satisifaction to ensure optimal revenues and profits. Although our data set was able to identify that business travelers, wifi services and online checking, additional data is required to not only know ones business but the competitive landscape. Adding the data on the airline company flown will help inform how one compares to their competition. Depending on the company strategy, a company can also assess whether the results from the customer survey are reflective of the company's missission or vision. Example. If you are discount airline, does one really care for the bells and whistles that an airline has to offer? We can used the data to focus on the attributes that matter to the company and use the data to validate their strategy. Customer survey data is very useful in assessing the health of the business with the customer and these models can help pin point where to go focus a company's attention.
Work embedded in work above.